COVID-19 analysis and visualization using Plotly
The purpose of this side project was to practice data visualization with Plotly, as I had not used this library in my data analytic studies at JAMK. I focused my visualization on disease situation in Finland and Europe in particular. I got the idea from a data project article on the Geeksforgeeks website. Three csv files were used in the project.
covid.csv – This dataset contains Country/Region, Continent, Population, TotalCases, NewCases, TotalDeaths, NewDeaths, TotalRecovered, NewRecovered, ActiveCases, Serious, Critical, Tot Cases/1M pop, Deaths/1M pop, TotalTests, Tests/1M pop, WHO Region, iso_alpha.
covid_grouped.csv – This dataset contains Date(from 20-01-22 to 20-07-27), Country/Region, Confirmed, Deaths, Recovered, Active, New cases, New deaths, New recovered, WHO Region, iso_alpha.
coviddeath.csv – This dataset contains real-world examples of a number of Covid-19 deaths and the reasons behind the deaths.
Plotly visualizations are not available on Github, but you can view them here by entering the link to my repository in the search box :)
import plotly.graph_objs as go
import plotly.io as pio
import plotly.express as px
import pandas as pd
# Data Visualization
import matplotlib.pyplot as plt
# Importing Plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
# Initializing Plotly
pio.renderers.default = 'plotly_mimetype+notebook'
# Importing Dataset1
dataset1 = pd.read_csv(r"C:\Users\taavi\OneDrive\Tiedostot\Jamk-opinnot\covid19-analysis-plotly\covid.csv")
dataset1.head() # returns first 5 rows
| Country/Region | Continent | Population | TotalCases | NewCases | TotalDeaths | NewDeaths | TotalRecovered | NewRecovered | ActiveCases | Serious,Critical | Tot Cases/1M pop | Deaths/1M pop | TotalTests | Tests/1M pop | WHO Region | iso_alpha | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | USA | North America | 3.311981e+08 | 5032179 | NaN | 162804.0 | NaN | 2576668.0 | NaN | 2292707.0 | 18296.0 | 15194.0 | 492.0 | 63139605.0 | 190640.0 | Americas | USA |
| 1 | Brazil | South America | 2.127107e+08 | 2917562 | NaN | 98644.0 | NaN | 2047660.0 | NaN | 771258.0 | 8318.0 | 13716.0 | 464.0 | 13206188.0 | 62085.0 | Americas | BRA |
| 2 | India | Asia | 1.381345e+09 | 2025409 | NaN | 41638.0 | NaN | 1377384.0 | NaN | 606387.0 | 8944.0 | 1466.0 | 30.0 | 22149351.0 | 16035.0 | South-EastAsia | IND |
| 3 | Russia | Europe | 1.459409e+08 | 871894 | NaN | 14606.0 | NaN | 676357.0 | NaN | 180931.0 | 2300.0 | 5974.0 | 100.0 | 29716907.0 | 203623.0 | Europe | RUS |
| 4 | South Africa | Africa | 5.938157e+07 | 538184 | NaN | 9604.0 | NaN | 387316.0 | NaN | 141264.0 | 539.0 | 9063.0 | 162.0 | 3149807.0 | 53044.0 | Africa | ZAF |
# Returns tuple of shape (Rows, columns)
print(dataset1.shape)
# Returns size of dataframe
print(dataset1.size)
# Information about Dataset1
# return concise summary of dataframe
dataset1.info()
(209, 17) 3553 <class 'pandas.core.frame.DataFrame'> RangeIndex: 209 entries, 0 to 208 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country/Region 209 non-null object 1 Continent 208 non-null object 2 Population 208 non-null float64 3 TotalCases 209 non-null int64 4 NewCases 4 non-null float64 5 TotalDeaths 188 non-null float64 6 NewDeaths 3 non-null float64 7 TotalRecovered 205 non-null float64 8 NewRecovered 3 non-null float64 9 ActiveCases 205 non-null float64 10 Serious,Critical 122 non-null float64 11 Tot Cases/1M pop 208 non-null float64 12 Deaths/1M pop 187 non-null float64 13 TotalTests 191 non-null float64 14 Tests/1M pop 191 non-null float64 15 WHO Region 184 non-null object 16 iso_alpha 209 non-null object dtypes: float64(12), int64(1), object(4) memory usage: 27.9+ KB
# Importing Dataset2
dataset2 = pd.read_csv(r"C:\Users\taavi\OneDrive\Tiedostot\Jamk-opinnot\covid19-analysis-plotly\covid_grouped.csv")
dataset2.head() # return first 5 rows of dataset2
| Date | Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | WHO Region | iso_alpha | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-01-22 | Afghanistan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Eastern Mediterranean | AFG |
| 1 | 2020-01-22 | Albania | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | ALB |
| 2 | 2020-01-22 | Algeria | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Africa | DZA |
| 3 | 2020-01-22 | Andorra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | AND |
| 4 | 2020-01-22 | Angola | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Africa | AGO |
# Returns tuple of shape (Rows, columns)
print(dataset2.shape)
# Returns size of dataframe
print(dataset2.size)
# Information about Dataset2
dataset2.info() # return concise summary of dataframe
(35156, 11) 386716 <class 'pandas.core.frame.DataFrame'> RangeIndex: 35156 entries, 0 to 35155 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 35156 non-null object 1 Country/Region 35156 non-null object 2 Confirmed 35156 non-null int64 3 Deaths 35156 non-null int64 4 Recovered 35156 non-null int64 5 Active 35156 non-null int64 6 New cases 35156 non-null int64 7 New deaths 35156 non-null int64 8 New recovered 35156 non-null int64 9 WHO Region 35156 non-null object 10 iso_alpha 35156 non-null object dtypes: int64(7), object(4) memory usage: 3.0+ MB
# Columns labels of a Dataset1
dataset1.columns
Index(['Country/Region', 'Continent', 'Population', 'TotalCases', 'NewCases',
'TotalDeaths', 'NewDeaths', 'TotalRecovered', 'NewRecovered',
'ActiveCases', 'Serious,Critical', 'Tot Cases/1M pop', 'Deaths/1M pop',
'TotalTests', 'Tests/1M pop', 'WHO Region', 'iso_alpha'],
dtype='object')
# Drop NewCases, NewDeaths, NewRecovered rows from dataset1
dataset1.drop(['NewCases', 'NewDeaths', 'NewRecovered'],
axis=1, inplace=True)
# Select random set of values from dataset1
dataset1.sample(5)
| Country/Region | Continent | Population | TotalCases | TotalDeaths | TotalRecovered | ActiveCases | Serious,Critical | Tot Cases/1M pop | Deaths/1M pop | TotalTests | Tests/1M pop | WHO Region | iso_alpha | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 180 | Bermuda | North America | 62254.0 | 157 | 9.0 | 144.0 | 4.0 | NaN | 2522.0 | 145.0 | 26352.0 | 423298.0 | Americas | BMU |
| 80 | Senegal | Africa | 16783877.0 | 10715 | 223.0 | 7101.0 | 3391.0 | 33.0 | 638.0 | 13.0 | 114761.0 | 6838.0 | Africa | SEN |
| 1 | Brazil | South America | 212710692.0 | 2917562 | 98644.0 | 2047660.0 | 771258.0 | 8318.0 | 13716.0 | 464.0 | 13206188.0 | 62085.0 | Americas | BRA |
| 22 | Indonesia | Asia | 273808365.0 | 118753 | 5521.0 | 75645.0 | 37587.0 | NaN | 434.0 | 20.0 | 1633156.0 | 5965.0 | South-EastAsia | IDN |
| 84 | French Guiana | South America | 299385.0 | 8127 | 47.0 | 7240.0 | 840.0 | 23.0 | 27146.0 | 157.0 | 41412.0 | 138324.0 | NaN | GUF |
px.bar(dataset1.head(15), x = 'Country/Region',
y = 'TotalCases',color = 'TotalCases',
height = 500,hover_data = ['Country/Region', 'Continent'])
px.bar(dataset1.head(15), x = 'Country/Region', y = 'TotalCases',
color = 'TotalDeaths', height = 500,
hover_data = ['Country/Region', 'Continent'])
px.bar(dataset1.head(15), x = 'TotalTests', y = 'Country/Region',
color = 'TotalTests',orientation ='h', height = 500,
hover_data = ['Country/Region', 'Continent'])
px.bar(dataset1.head(15), x = 'TotalTests', y = 'Continent',
color = 'TotalTests',orientation ='h', height = 500,
hover_data = ['Country/Region', 'Continent'])
px.scatter(dataset1.head(57), x='Continent',y='TotalCases',
hover_data=['Country/Region', 'Continent'],
color='TotalCases', size='TotalCases', size_max=80, log_y=True)
px.scatter(dataset1.head(50), x='Continent',y='TotalTests',
hover_data=['Country/Region', 'Continent'],
color='TotalTests', size='TotalTests', size_max=80, log_y=True)
px.scatter(dataset1.head(100), x='Country/Region', y='TotalCases',
hover_data=['Country/Region', 'Continent'],
color='TotalCases', size='TotalCases', size_max=80)
px.scatter(dataset1.head(30), x='Country/Region', y='TotalCases',
hover_data=['Country/Region', 'Continent'],
color='Country/Region', size='TotalCases', size_max=80, log_y=True)
px.scatter(dataset1.head(10), x='Country/Region', y= 'TotalDeaths',
hover_data=['Country/Region', 'Continent'],
color='Country/Region', size= 'TotalDeaths', size_max=80)
px.scatter(dataset1.head(30), x='Country/Region', y= 'Tests/1M pop',
hover_data=['Country/Region', 'Continent'],
color='Country/Region', size= 'Tests/1M pop', size_max=80)
px.bar(dataset2, x="Date", y="Confirmed", color="Confirmed",
hover_data=["Confirmed", "Date", "Country/Region"], height=400)
px.bar(dataset2, x="Date", y="Confirmed", color="Confirmed",
hover_data=["Confirmed", "Date", "Country/Region"],log_y=True, height=400)
px.bar(dataset2, x="Date", y="Deaths", color="Deaths",
hover_data=["Confirmed", "Date", "Country/Region"],
log_y=False, height=400)
data_FIN = dataset2.loc[dataset2['Country/Region'] == 'Finland', ]
data_FIN.head()
| Date | Country/Region | Confirmed | Deaths | Recovered | Active | New cases | New deaths | New recovered | WHO Region | iso_alpha | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 60 | 2020-01-22 | Finland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | FIN |
| 247 | 2020-01-23 | Finland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | FIN |
| 434 | 2020-01-24 | Finland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | FIN |
| 621 | 2020-01-25 | Finland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | FIN |
| 808 | 2020-01-26 | Finland | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Europe | FIN |
px.bar(data_FIN, x="Date", y="Confirmed", color="Confirmed", height=400)
px.bar(data_FIN,x="Date", y="Recovered", color="Recovered", height=400)
px.line(data_FIN,x="Date", y="Recovered", height=400)
px.line(data_FIN,x="Date", y="Deaths", height=400)
px.line(data_FIN,x="Date", y="Confirmed", height=400)
px.line(data_FIN,x="Date", y="New cases", height=400)
px.bar(data_FIN,x="Date", y="New cases", height=400)
px.scatter(data_FIN, x="Confirmed", y="Deaths", height=400)
data_EU = dataset2.loc[dataset2['WHO Region'] == 'Europe']
fig = px.choropleth(data_EU,
locations='iso_alpha',
color="Confirmed",
hover_name="Country/Region",
color_continuous_scale="Reds",
animation_frame="Date",
range_color=(0, data_EU["Confirmed"].max()))
fig.update_geos(
projection_scale=2,
scope="europe"
)
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 50
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 0
fig.show()
fig = px.choropleth(data_EU,
locations='iso_alpha',
color="Deaths",
hover_name="Country/Region",
color_continuous_scale="Reds",
animation_frame="Date",
range_color=(0, data_EU["Deaths"].max()))
fig.update_geos(
projection_scale=2,
scope="europe"
)
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 50
fig.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 0
fig.show()
fig2 = px.bar(data_EU, x="Country/Region", y="Confirmed", color="Country/Region",
animation_frame="Date", hover_name="Country/Region", range_color=(0, data_EU["Confirmed"].max()))
fig2.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 100
fig2.layout.updatemenus[0].buttons[0].args[1]["transition"]["duration"] = 0
fig2.update_layout(
height=900,
width=1200,
xaxis=dict(tickangle=-45)
)
fig2.show()
dataset3= pd.read_csv(r"C:\Users\taavi\OneDrive\Tiedostot\Jamk-opinnot\covid19-analysis-plotly\coviddeath.csv")
dataset3.head()
| Data as of | Start Week | End Week | State | Condition Group | Condition | ICD10_codes | Age Group | Number of COVID-19 Deaths | Flag | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 08/30/2020 | 02/01/2020 | 08/29/2020 | US | Respiratory diseases | Influenza and pneumonia | J09-J18 | 0-24 | 122.0 | NaN |
| 1 | 08/30/2020 | 02/01/2020 | 08/29/2020 | US | Respiratory diseases | Influenza and pneumonia | J09-J18 | 25-34 | 596.0 | NaN |
| 2 | 08/30/2020 | 02/01/2020 | 08/29/2020 | US | Respiratory diseases | Influenza and pneumonia | J09-J18 | 35-44 | 1521.0 | NaN |
| 3 | 08/30/2020 | 02/01/2020 | 08/29/2020 | US | Respiratory diseases | Influenza and pneumonia | J09-J18 | 45-54 | 4186.0 | NaN |
| 4 | 08/30/2020 | 02/01/2020 | 08/29/2020 | US | Respiratory diseases | Influenza and pneumonia | J09-J18 | 55-64 | 10014.0 | NaN |
dataset3.tail()
| Data as of | Start Week | End Week | State | Condition Group | Condition | ICD10_codes | Age Group | Number of COVID-19 Deaths | Flag | |
|---|---|---|---|---|---|---|---|---|---|---|
| 12255 | 08/30/2020 | 02/01/2020 | 08/29/2020 | YC | Coronavirus Disease 2019 | COVID-19 | U071 | 65-74 | 5024.0 | NaN |
| 12256 | 08/30/2020 | 02/01/2020 | 08/29/2020 | YC | Coronavirus Disease 2019 | COVID-19 | U071 | 75-84 | 5381.0 | NaN |
| 12257 | 08/30/2020 | 02/01/2020 | 08/29/2020 | YC | Coronavirus Disease 2019 | COVID-19 | U071 | 85+ | 4841.0 | NaN |
| 12258 | 08/30/2020 | 02/01/2020 | 08/29/2020 | YC | Coronavirus Disease 2019 | COVID-19 | U071 | Not stated | NaN | Counts less than 10 suppressed. |
| 12259 | 08/30/2020 | 02/01/2020 | 08/29/2020 | YC | Coronavirus Disease 2019 | COVID-19 | U071 | All ages | 20628.0 | NaN |
dataset3.groupby(["Condition"]).count()
| Data as of | Start Week | End Week | State | Condition Group | ICD10_codes | Age Group | Number of COVID-19 Deaths | Flag | |
|---|---|---|---|---|---|---|---|---|---|
| Condition | |||||||||
| Adult respiratory distress syndrome | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 272 | 268 |
| All other conditions and causes (residual) | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 363 | 177 |
| Alzheimer disease | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 144 | 386 |
| COVID-19 | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 377 | 163 |
| Cardiac arrest | 520 | 520 | 520 | 520 | 520 | 520 | 520 | 219 | 301 |
| Cardiac arrhythmia | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 192 | 348 |
| Cerebrovascular diseases | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 187 | 343 |
| Chronic lower respiratory diseases | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 229 | 311 |
| Diabetes | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 276 | 264 |
| Heart failure | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 204 | 336 |
| Hypertensive diseases | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 264 | 276 |
| Influenza and pneumonia | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 331 | 209 |
| Intentional and unintentional injury, poisoning, and other adverse events | 520 | 520 | 520 | 520 | 520 | 520 | 520 | 188 | 332 |
| Ischemic heart disease | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 224 | 316 |
| Malignant neoplasms | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 198 | 342 |
| Obesity | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 182 | 348 |
| Other diseases of the circulatory system | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 213 | 317 |
| Other diseases of the respiratory system | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 188 | 352 |
| Renal failure | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 238 | 302 |
| Respiratory arrest | 480 | 480 | 480 | 480 | 480 | 480 | 480 | 111 | 369 |
| Respiratory failure | 540 | 540 | 540 | 540 | 540 | 540 | 540 | 320 | 220 |
| Sepsis | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 243 | 287 |
| Vascular and unspecified dementia | 530 | 530 | 530 | 530 | 530 | 530 | 530 | 191 | 339 |
# import word cloud
from wordcloud import WordCloud
sentences = dataset3["Condition"].tolist()
sentences_as_a_string = ' '.join(sentences)
# Convert the string into WordCloud
plt.figure(figsize=(40, 40))
plt.imshow(WordCloud().generate(sentences_as_a_string))
<matplotlib.image.AxesImage at 0x1944f085c90>
column2_tolist= dataset3["Condition Group"].tolist()
# Convert the list to one single string
column_to_string= " ".join(column2_tolist)
# Convert the string into WordCloud
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(column_to_string))
<matplotlib.image.AxesImage at 0x1944efb6470>